A Short Introduction to the Pandas Package

Pandas is the Python Data Analysis Library from the makers of scipy, numpy, and IPython. It provides the perfect data structures for textual matrices, the Series and the DataFrame. Let's jump right in.



In [2]:

    
import pandas as pd #the recommended import condition from pandas
import numpy as np
sentence = 'the dog bit the man' #our first sentence from the presentation
token_list = sentence.split()
type_list = list(set(token_list)) #we only want each word type listed once
print(type_list)
print(token_list)









    



['man', 'the', 'bit', 'dog']
['the', 'dog', 'bit', 'the', 'man']

Now, let's initialize a Series with these index labels.



In [3]:

    
ser1 = pd.Series(index = type_list)
print(ser1)









    



man   NaN
the   NaN
bit   NaN
dog   NaN
dtype: float64

If we want to initialize it with values, we can add do this with the data argument or with a dictionary. First, let's get the counts for each word in the sentence.



In [4]:

    
count_list = []
count_dict = {}
for word in type_list:
    count_list.append(token_list.count(word))
    count_dict[word] = token_list.count(word)
print(count_list)
print(type_list)
print(count_dict)









    



[1, 2, 1, 1]
['man', 'the', 'bit', 'dog']
{'man': 1, 'the': 2, 'bit': 1, 'dog': 1}

Now, we can create our Series objects.



In [5]:

    
ser2 = pd.Series(data = count_list, index = type_list)
ser3 = pd.Series(count_dict)
print(ser2)
print(ser3)









    



man    1
the    2
bit    1
dog    1
dtype: int64
bit    1
dog    1
man    1
the    2
dtype: int64

A Series is essentially a labelled vector, here a frequency term-document vector. In order to construct a term-document matrix, we can create another Series for our second sentence.

Quiz:

Below, write a short function that takes as its input a string of text and outputs a dictionary of word counts and a term-document Series.



In [8]:

    
sent2 = 'the bat hit the ball'
from collections import Counter

def td_Series(text):
    # insert your code here to create a count dictionary and a term-document vector for sent2
    counter = Counter(text.split())
    return counter, pd.Series(counter)

count_dict2, ser4 = td_Series(sent2)
print(ser4)
print(count_dict2)









    



ball    1
bat     1
hit     1
the     2
dtype: int64
Counter({'the': 2, 'bat': 1, 'hit': 1, 'ball': 1})

At this point, we have two separate Series representing two different term-document vectors. We can bring them together to create a DataFrame, the primary object type in the Pandas package.



In [9]:

    
df1 = pd.DataFrame(data = [ser3, ser4], index = ['sent1', 'sent2'])
print(df1) 
# if you don't print the dataframe, it will give you a nice HTML formatted view on the table:
df1









    



       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]






    Out[9]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
      NaN
      NaN
        1
        1
      NaN
        1
       2
    
    
      sent2
        1
        1
      NaN
      NaN
        1
      NaN
       2
    
  

2 rows × 7 columns

Notice that we now have a $m \times n$ term-document matrix. We could also create the DataFrame by calling our count_dicts directly. In this DataFrame, let's replace all nan values with 0.



In [14]:

    
df2 = pd.DataFrame(data = [count_dict, count_dict2], index = ['sent1', 'sent2'])
print(df2)
df2 = df2.fillna(value = 0)
df2









    



       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]






    Out[14]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
       0
       0
       1
       1
       0
       1
       2
    
    
      sent2
       1
       1
       0
       0
       1
       0
       2
    
  

2 rows × 7 columns

And now we can call values simply by naming the row, column name pairs. Name the row first, then the column.



In [16]:

    
print(df1.ix['sent1', 'ball'])
print(df1.ix['sent2', 'ball'])
print(df2.ix['sent1', 'ball'])
print(df2.ix['sent2', 'ball'])
# or do it like this: 
print(df1.ball.sent1)
df1









    



nan
1.0
0.0
1.0
nan






    Out[16]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
      NaN
      NaN
        1
        1
      NaN
        1
       2
    
    
      sent2
        1
        1
      NaN
      NaN
        1
      NaN
       2
    
  

2 rows × 7 columns

We can also call them by their row and column indices. Again, first row, then column.



In [17]:

    
df1.ix[0,0]









    Out[17]:





nan



In [18]:

    
df1.ix[1,0]









    Out[18]:





1.0



In [19]:

    
df2.ix[0,0]









    Out[19]:





0.0



In [20]:

    
df2.ix[1,0]









    Out[20]:





1.0



In [21]:

    
df2.ix[0]









    Out[21]:





ball    0
bat     0
bit     1
dog     1
hit     0
man     1
the     2
Name: sent1, dtype: float64



In [22]:

    
df2.index









    Out[22]:





Index(['sent1', 'sent2'], dtype='object')



In [23]:

    
df2.values # which will return a numpy 2d array









    Out[23]:





array([[ 0.,  0.,  1.,  1.,  0.,  1.,  2.],
       [ 1.,  1.,  0.,  0.,  1.,  0.,  2.]])

Below are a few other things you can do with a DataFrame.



In [24]:

    
df2.min(axis = 0)









    Out[24]:





ball    0
bat     0
bit     0
dog     0
hit     0
man     0
the     2
dtype: float64



In [25]:

    
df2.min(axis = 1)









    Out[25]:





sent1    0
sent2    0
dtype: float64



In [26]:

    
np.min(df2, axis = 1) # numpy function works but is slightly slower









    Out[26]:





sent1    0
sent2    0
dtype: float64



In [27]:

    
df2.max(axis = 1)









    Out[27]:





sent1    2
sent2    2
dtype: float64



In [28]:

    
df2.idxmin(axis = 1) # index of the min









    Out[28]:





sent1    ball
sent2     bit
dtype: object



In [29]:

    
df2.idxmax(axis = 1) # index of the max









    Out[29]:





sent1    the
sent2    the
dtype: object



In [30]:

    
df2.values.max() # max of all of the values









    Out[30]:





2.0

And simple statistics.



In [31]:

    
df2.describe()









    Out[31]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      count
       2.000000
       2.000000
       2.000000
       2.000000
       2.000000
       2.000000
       2
    
    
      mean
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       2
    
    
      std
       0.707107
       0.707107
       0.707107
       0.707107
       0.707107
       0.707107
       0
    
    
      min
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       2
    
    
      25%
       0.250000
       0.250000
       0.250000
       0.250000
       0.250000
       0.250000
       2
    
    
      50%
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       2
    
    
      75%
       0.750000
       0.750000
       0.750000
       0.750000
       0.750000
       0.750000
       2
    
    
      max
       1.000000
       1.000000
       1.000000
       1.000000
       1.000000
       1.000000
       2
    
  

8 rows × 7 columns



In [32]:

    
df2.mean(axis = 1)









    Out[32]:





sent1    0.714286
sent2    0.714286
dtype: float64



In [33]:

    
df2.ix['sent1'].mean()









    Out[33]:





0.7142857142857143



In [34]:

    
df2.std(axis = 1) # standard deviation









    Out[34]:





sent1    0.755929
sent2    0.755929
dtype: float64

Now, what can we do with this? We can use, e.g., the correlation metric in Pandas.



In [35]:

    
df2.irow(0).corr(df2.irow(1))









    Out[35]:





0.12499999999999988

Or, if we have the scikit-learn package, there is a lot more we can do.</br>

Note: to install scikit-learn on Linux with Python 3.4, use the following command:</br> [sudo] pip3 install git+https://github.com/scikit-learn/scikit-learn.git.

The tf-idf metric stands for 'term frequency-inverse document frequency'. It weights the importance each word has for each document based on how often it occurs in the document and the inverse of how many documents contain it in the corpus.



In [36]:

    
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(df2)
df_tfidf = pd.DataFrame(data = tfidf.toarray(), index = df2.index, columns = df2.columns)
print(df_tfidf)









    



           ball       bat       bit       dog       hit       man       the
sent1  0.000000  0.000000  0.446101  0.446101  0.000000  0.446101  0.634809
sent2  0.446101  0.446101  0.000000  0.000000  0.446101  0.000000  0.634809

[2 rows x 7 columns]

You can also measure the distance between two documents with the pairwise_distances function in sklearn.



In [37]:

    
from sklearn.metrics.pairwise import pairwise_distances
euclid = pairwise_distances(df2) #Euclidean distance between the two documents.
df_euclid = pd.DataFrame(data = euclid, index = df2.index, columns = df2.index)
print(df_euclid)









    



         sent1    sent2
sent1  0.00000  2.44949
sent2  2.44949  0.00000

[2 rows x 2 columns]

Quiz:
Now its your turn. There are many texts in the `Data` sub-directory in this directory. Write a function that takes as input a text file's path, reads the text from the file, splits it into its individual words, and a Series with the word types (i.e., unique words) as the index and the number of times they occur as the values.



In [41]:

    
def split_txt(filename):
    # write your code here
    with open(filename) as f:
        my_file = f.read()
    d, s = td_Series(my_file)
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
print(emma_Series[:20])









    



"'Tis              1
"--Mrs.            1
"A                13
"A.                1
"About             1
"Agreed,           1
"Ah!              27
"Ah!"              3
"Ah!--(shaking     1
"Ah!--Indeed       1
"Ah!--so           1
"Ah!--well--to     1
"Ah,               2
"Almost            1
"And              45
"And,              3
"Another           1
"Are               4
"As                8
"At                1
dtype: int64

Take a look at the first 20 members of the Series. It looks like we have a couple of problems: capitalization and punctuation. Edit your function below to solve these problems.
Hint: use the punctuation constant in the string library to recognize punctuation.



In [42]:

    
from string import punctuation
print(punctuation)









    



!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~



In [45]:

    
from string import punctuation
import re
def split_txt(filename):
    # write your code here
    with open(filename, encoding = 'utf-8') as f:
        my_file = f.read()
    #lowercase here
    my_file = my_file.lower()
    #strip punctuation here
    for c in punctuation:
        my_file = my_file.replace(c, ' ')
    d, s = td_Series(my_file)
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
'''
The following code checks whether you have successfully cleaned your corpus.
Please do not change it.
'''
problems = []
for word in emma_Series.index:
    if re.search('[\WA-Z]', word):
        problems.append(word)
print(len(problems))



In [46]:

    
series = {}
for f in files:
    split_txt(f)
    series[f] = split_txt(f)









    



000             2
10              2
1816            1
23rd            1
24th            1
26th            1
28th            2
7th             1
8th             1
a            3130
abbey          31
abbots          1
abdy            1
abhor           1
abhorred        1
abide           1
abilities       3
able           72
abode           1
abolition       1
dtype: int64

If the length of the problems list is not 0, then you are not yet finished. Take a look at your results to check what you did wrong and edit your code to correct the problem.

You now have a function that can take a text, clean it, and produce a term-document array (Series). Now, you should integrate this function into a script that will read and clean all the texts in the ./Data folder. You should then integrate all of the resulting Series into one large term-document matrix. Transform this matrix into a tf-idf matrix, and then run at least 5 of the metrics under pairwise_distances in sklearn.



In [47]:

    
from os import listdir

texts = listdir('./Data')
texts









    Out[47]:





['austen-emma.txt',
 'austen-pride.txt',
 'austen-sense.txt',
 'blake-poems.txt',
 'blake-songs.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-piazza.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'whitman-leaves.txt',
 'whitman-patriotic.txt',
 'whitman-poems.txt']



In [48]:

    
#Write your code here or in a separate .py file. __Make sure I know where to find your file!__
from os import listdir

texts = listdir('./Data')
s_list = []
for f in texts:
    s_list.append(split_txt('/'.join(['./Data', f])))
td_df = pd.DataFrame(s_list, index = texts).fillna(0)
td_df









    Out[48]:






  
    
      
      
      0
      00
      000
      00021053
      00081429
      00482129
      01
      02
      1
      10
      100
      1000
      10000
      100°
      11
      112
      113
      116
      118
      
    
  
  
    
      austen-emma.txt
       0
       0
       0
       2
       0
       0
       0
       0
       0
        0
       2
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      austen-pride.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        1
       1
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
    
    
      austen-sense.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        2
       1
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
    
    
      blake-poems.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      blake-songs.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        1
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      bryant-stories.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      burgess-busterbrown.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       1
       0
       0
       0
      ...
    
    
      carroll-alice.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      chesterton-ball.txt
       0
       0
       1
       1
       0
       0
       0
       1
       2
        7
       1
       2
       1
       1
       0
       1
       0
       1
       0
       0
      ...
    
    
      chesterton-thursday.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      edgeworth-parents.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      melville-piazza.txt
       0
       0
       0
       6
       0
       0
       0
       0
       0
        2
       1
       0
       0
       0
       0
       1
       0
       0
       0
       0
      ...
    
    
      milton-paradise.txt
       1
       0
       0
       0
       1
       1
       1
       0
       0
        0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      shakespeare-caesar.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
       19
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      shakespeare-hamlet.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
        4
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
    
    
      whitman-leaves.txt
       0
       1
       0
       0
       0
       0
       0
       0
       0
       36
       7
       0
       0
       0
       0
       7
       0
       0
       0
       0
      ...
    
    
      whitman-patriotic.txt
       0
       0
       0
       0
       0
       0
       0
       0
       0
       10
       3
       0
       0
       0
       0
       3
       0
       1
       1
       1
      ...
    
    
      whitman-poems.txt
       0
       0
       0
       1
       0
       0
       0
       0
       0
       83
       7
       1
       0
       0
       1
       6
       0
       0
       0
       0
      ...
    
  

18 rows × 32519 columns



In [51]:

    
len(td_df.ix[0])









    Out[51]:





32519

Consider your results from each of these different metrics. Is there anything that suggests which of these metrics are better for analyzing this data?

Write your answer in this text box, below this line.

Your answer:

blahblahblah

	ball	bat	bit	dog	hit	man	the
count	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2
mean	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	2
std	0.707107	0.707107	0.707107	0.707107	0.707107	0.707107	0
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2
25%	0.250000	0.250000	0.250000	0.250000	0.250000	0.250000	2
50%	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	2
75%	0.750000	0.750000	0.750000	0.750000	0.750000	0.750000	2
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	2

		0	00	000	00021053	00081429	00482129	01	02	1	10	100	1000	10000	100°	11	112	113	116	118
austen-emma.txt	0	0	0	2	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	...
austen-pride.txt	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	1	0	0	0	0	...
austen-sense.txt	0	0	0	0	0	0	0	0	0	2	1	0	0	0	0	1	0	0	0	0	...
blake-poems.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...
blake-songs.txt	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	...
bryant-stories.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...
burgess-busterbrown.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	...
carroll-alice.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...
chesterton-ball.txt	0	0	1	1	0	0	0	1	2	7	1	2	1	1	0	1	0	1	0	0	...
chesterton-thursday.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...
edgeworth-parents.txt	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...
melville-piazza.txt	0	0	0	6	0	0	0	0	0	2	1	0	0	0	0	1	0	0	0	0	...
milton-paradise.txt	1	0	0	0	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	...
shakespeare-caesar.txt	0	0	0	0	0	0	0	0	0	19	0	0	0	0	0	0	0	0	0	0	...
shakespeare-hamlet.txt	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0	0	...
whitman-leaves.txt	0	1	0	0	0	0	0	0	0	36	7	0	0	0	0	7	0	0	0	0	...
whitman-patriotic.txt	0	0	0	0	0	0	0	0	0	10	3	0	0	0	0	3	0	1	1	1	...
whitman-poems.txt	0	0	0	1	0	0	0	0	0	83	7	1	0	0	1	6	0	0	0	0	...